## Collaborative MARL for AutoML Pipeline Construction

EasyML is a research artifact implementing a Teacher–Student Multi-Agent Reinforcement Learning (MARL) framework for automated machine learning (AutoML) pipeline synthesis. This README is intentionally scrubbed of identifying metadata (repository owner names, emails) for double‑blind review.

## Overview

The framework operationalizes pedagogical guidance for pipeline construction: a student agent proposes sequential components; a teacher agent performs selective, counterfactual interventions only when estimated advantage exceeds a dynamic threshold. Component-level credit assignment provides interpretability, transfer readiness, and structural pruning without resorting to opaque large language model (LLM) orchestration.

## Why Not LLM Orchestration?

LLM-based multi-agent AutoML systems introduce prompt sensitivity, higher latency, weaker reproducibility, and limited grounded attribution. This framework offers:
- Deterministic seeds and auditable intervention logs
- Bounded action branching via feasibility masks
- Numeric credit vectors (marginal / historical) instead of free-form rationales
- Lower computational + monetary cost (compact value networks)
- Pluggable domain operators without prompt engineering

## Key Features

- Asymmetric Teacher–Student MARL with selective counterfactual intervention
- Structured, component-level credit assignment (sparse ablation + historical reuse)
- Knowledge transfer (zero-shot, partial fine-tune, joint fine-tune)
- Emergent curriculum (guidance attenuation over time)
- Sample efficiency via early pruning of low-yield branches
- Interpretability: override logs, credit vectors, failure counters
- Modular component registry (easy domain specialization)

## Installation

- unzip the submission.zip file

python -m venv env
source env/bin/activate  # Windows: env\Scripts\activate
pip install -r requirements.txt
```
## Running Core MARL Training

```bash
# Standard run
python -m marl.train --dataset iris --episodes 300

# Larger dataset example
python -m marl.train --dataset adult --episodes 500 --eval-timeout 300

# Deterministic seed
python -m marl.train --dataset bank-marketing --episodes 400 --seed 42
```

### Available Datasets
- iris
- adult
- covertype
- credit-g
- bank-marketing

(Ensure dataset loader paths match names above; loaders reside in the experiment utilities.)

### Quick vs Full Runs
```bash
# Fast smoke test
ython -m marl.train --dataset iris --episodes 80
# Closer to reported scale
python -m marl.train --dataset adult --episodes 500
```

## Benchmarking

Scripts compare MARL against classical baselines under unified timeout/evaluation budgets.
```bash
# All benchmarks (example dataset flag if supported)
python -m experiments.benchmarks.benchmark_all --dataset adult --eval-timeout 300

# Individual baselines (adjust max-evals/time budgets)
python -m experiments.baselines.random_search --dataset adult --max-evals 200
python -m experiments.baselines.grid_search --dataset adult --max-evals 200
python -m experiments.baselines.tpot_baseline --dataset adult --time-budget 1800
```
Auto-sklearn baseline is deprecated (instability + environment constraints) and intentionally excluded from default tables; H2O AutoML (if added) can be run separately (not shown here if dependency omitted for anonymity).

### (Illustrative) Benchmark Result Snapshot (Adult)
| Method | Accuracy | F1 | Time (s) |
|--------|---------:|---:|---------:|
| Grid Search | 85.74% | 85.62% | 118.2 |
| Random Search | 85.68% | 85.55% | 42.5 |
| TPOT | 85.97% | 85.83% | 452.8 |
| MARL AutoML | 86.20% | 86.15% | 215.6 |

## Knowledge Transfer
```bash
# Source pre-training then adaptation
python -m experiments.transfer.knowledge_transfer \
  --source iris --target adult \
  --source_episodes 200 --target_episodes 60 --eval-timeout 300
```
Modes (if implemented): zero-shot eval, freeze-teacher fine-tune, joint fine-tune.

## Ablation (Preliminary / Qualitative)
Legacy numeric ablations invalidated after environment refactor. Qualitative modes:
```bash
python -m experiments.ablation.run_ablation --dataset adult --episodes 60 --mode no_teacher
python -m experiments.ablation.run_ablation --dataset adult --episodes 60 --mode no_credit
python -m experiments.ablation.run_ablation --dataset adult --episodes 60 --mode no_adaptive_exploration
```

## Quick Sanity Test
```bash
python -m experiments.test --dataset adult --eval-timeout 300
```

## Interpretability Artifacts
During/after training the following are emitted:
- Learning curves (reward + validation trajectory)
- Intervention rate plots
- Pipeline evolution visualization (length vs performance)
- Teacher contribution breakdown
- Credit distribution traces
- JSON logs: episode metrics, invalid/exception counters, override records

## Framework Internals (Appendix Style Summary)
- Environment: feasibility masks + timeout-guarded evaluation
- Student: Double DQN over valid component actions (+ END)
- Teacher: Advantage-based override meta-policy with decaying threshold
- Credit: Sparse ablations + historical reuse → normalized weights guiding replay priority
- Transfer: Parameter reuse with adjustable exploration warm-start
- Failure Accounting: Distinct counters (timeout, incompatibility, exception) for auditability

## Reproducibility Notes
- Unified evaluation timeout (default 300s) propagated through experiment scripts
- Seeds set for Python, NumPy, Torch; residual nondeterminism may remain in parallelized estimators
- Remove or constrain thread pools (e.g., OMP_NUM_THREADS) for stricter determinism if required
- Multi-seed benchmarking recommended (≥3) for publication-quality variance estimates

## Extending the Component Library
1. Implement fit/transform or fit/predict adapter
2. Declare compatibility metadata (input types, output schema)
3. Register in the component registry module; feasibility mask updates automatically
Credit and intervention logic immediately incorporate new components—no architecture change.

## Comparison to LLM Pipelines (Summary)
| Aspect | This Framework | LLM Orchestration |
|--------|----------------|-------------------|
| Attribution | Numeric credit + advantage logs | Textual rationales (non-binding) |
| Reproducibility | Seeded, bounded action set | Prompt / decoding sensitive |
| Cost / Latency | Lightweight value nets | High token inference cost |
| Search Control | Explicit masks + gating | Indirect via prompts |
| Domain Injection | Add component class | Prompt / fine-tune LLM |
| Transfer | Structural priors reuse | Re-derive heuristics |

## Common Issues
- Memory pressure (large data): reduce batch size / episodes
- Slow progress: check intervention rate; adjust threshold or episodes
- Baseline tool failure: disable optional baseline flags; MARL unaffected
- Variance across runs: average multiple seeds

